{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3c1fc107-77e9-48d8-8054-ca376a8d7317",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T03:57:26.504495Z",
     "iopub.status.busy": "2024-07-14T03:57:26.502860Z",
     "iopub.status.idle": "2024-07-14T03:57:36.908728Z",
     "shell.execute_reply": "2024-07-14T03:57:36.906437Z",
     "shell.execute_reply.started": "2024-07-14T03:57:26.504437Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'translation-agent'...\n",
      "remote: Enumerating objects: 226, done.\u001b[K\n",
      "remote: Counting objects: 100% (75/75), done.\u001b[K\n",
      "remote: Compressing objects: 100% (42/42), done.\u001b[K\n",
      "remote: Total 226 (delta 48), reused 41 (delta 31), pack-reused 151\u001b[K\n",
      "Receiving objects: 100% (226/226), 26.82 MiB | 4.63 MiB/s, done.\n",
      "Resolving deltas: 100% (87/87), done.\n"
     ]
    }
   ],
   "source": [
    "!git clone git@github.com:andrewyng/translation-agent.git"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71021bb4-2590-4b74-9f0c-5de4b01f9fe7",
   "metadata": {},
   "source": [
    "## Andrew NG `Agentic Reasoning`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ee2efa9-d557-4e7e-82d0-35d4c08fdab4",
   "metadata": {},
   "source": [
    "> GPT-4 + Agent = GPT-5 （挺标题党的）\n",
    "\n",
    "- prompt engineering（LLM-based agents）\n",
    "    - modern：LLMs 重写一切；\n",
    "    - effective\n",
    "    - for engineer & research\n",
    "- 今年4月份的 Agentic 的演讲，6月份的 translation-agent（截止到目前4k的star） 的一个具体实践；\n",
    "    - https://github.com/andrewyng/translation-agent\n",
    "        - https://github.com/andrewyng/translation-agent/blob/main/src/translation_agent/utils.py\n",
    "- workflow\n",
    "    - 复杂任务的分解和抽象；\n",
    "        - step by steps 的完成一些相对较为简单的子任务要比 LLM 直出的完成一个复杂任务，更为简单而有效；\n",
    "    - 现实世界人类经验的镜像；\n",
    "- Agentic Reasoning design patterns\n",
    "    - **Reflection**\n",
    "    - Tool use\n",
    "    - Planning\n",
    "    - Multi-Agent Collaboration\n",
    "- 推荐下[《大模型应用开发 动手做AI Agent GPT大语言模型应用》](https://www.bilibili.com/opus/935785456083140628?spm_id_from=333.999.0.0)\n",
    "    - 面向开发者\n",
    "    - 系统而全面"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0998c989-d201-4046-9f89-6778342d7805",
   "metadata": {},
   "source": [
    "## prompt & workflow"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9330bb7e-21b0-4bdb-b358-31ad66f4ba2d",
   "metadata": {},
   "source": [
    "\n",
    "- prompt: 具体的业务要求；\n",
    "- (agentic) workflow：对应着一种分解和抽象；\n",
    "\n",
    "```\n",
    "def translate(\n",
    "    source_lang,\n",
    "    target_lang,\n",
    "    source_text,\n",
    "    country,\n",
    "    max_tokens=MAX_TOKENS_PER_CHUNK,\n",
    "):\n",
    "    if ...\n",
    "        final_translation = one_chunk_translate_text(\n",
    "                source_lang, target_lang, source_text, country\n",
    "            )\n",
    "    \n",
    "        return final_translation\n",
    "    else:\n",
    "        source_text_chunks = text_splitter.split_text(source_text)\n",
    "\n",
    "        translation_2_chunks = multichunk_translation(\n",
    "            source_lang, target_lang, source_text_chunks, country\n",
    "        )\n",
    "\n",
    "        return \"\".join(translation_2_chunks)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc7d5dc9-9ebc-45cb-98c9-4e006c0103f9",
   "metadata": {},
   "source": [
    "### onechunk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c36bc83-5eb0-4f8f-ad19-dc6afec87113",
   "metadata": {},
   "source": [
    "\n",
    "- `one_chunk_initial_translation`\n",
    "- `one_chunk_reflect_on_translation`\n",
    "- `one_chunk_translate_text`\n",
    "\n",
    "    ```\n",
    "    def one_chunk_translate_text(\n",
    "        source_lang: str, target_lang: str, source_text: str, country: str = \"\"\n",
    "    ) -> str:\n",
    "        translation_1 = one_chunk_initial_translation(\n",
    "            source_lang, target_lang, source_text\n",
    "        )\n",
    "        \n",
    "        reflection = one_chunk_reflect_on_translation(\n",
    "            source_lang, target_lang, source_text, translation_1, country\n",
    "        )\n",
    "        translation_2 = one_chunk_improve_translation(\n",
    "            source_lang, target_lang, source_text, translation_1, reflection\n",
    "        )\n",
    "        return translation_2\n",
    "    ```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f193287a-5cdc-4d40-9421-4dbcd684a660",
   "metadata": {},
   "source": [
    "### multichunk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad2aeb86-8f8e-47f1-acb1-0c2de865b219",
   "metadata": {},
   "source": [
    "```\n",
    "def multichunk_translation(\n",
    "    source_lang, target_lang, source_text_chunks, country: str = \"\"\n",
    "):\n",
    "        translation_1_chunks = multichunk_initial_translation(\n",
    "            source_lang, target_lang, source_text_chunks\n",
    "        )\n",
    "    \n",
    "        reflection_chunks = multichunk_reflect_on_translation(\n",
    "            source_lang,\n",
    "            target_lang,\n",
    "            source_text_chunks,\n",
    "            translation_1_chunks,\n",
    "            country,\n",
    "        )\n",
    "    \n",
    "        translation_2_chunks = multichunk_improve_translation(\n",
    "            source_lang,\n",
    "            target_lang,\n",
    "            source_text_chunks,\n",
    "            translation_1_chunks,\n",
    "            reflection_chunks,\n",
    "        )\n",
    "    \n",
    "        return translation_2_chunks\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcbfb901-8743-4534-9e36-bf9bd155aaa8",
   "metadata": {},
   "source": [
    "- `split chunks；`\n",
    "- `from langchain_text_splitters import RecursiveCharacterTextSplitter`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c056b7da-875c-4e2b-a21e-898c6647fb60",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T11:48:24.074267Z",
     "iopub.status.busy": "2024-07-14T11:48:24.072985Z",
     "iopub.status.idle": "2024-07-14T11:48:24.085227Z",
     "shell.execute_reply": "2024-07-14T11:48:24.083940Z",
     "shell.execute_reply.started": "2024-07-14T11:48:24.074212Z"
    }
   },
   "outputs": [],
   "source": [
    "def calculate_chunk_size(token_count: int, token_limit: int) -> int:\n",
    "    \"\"\"\n",
    "    Calculate the chunk size based on the token count and token limit.\n",
    "\n",
    "    Args:\n",
    "        token_count (int): The total number of tokens.\n",
    "        token_limit (int): The maximum number of tokens allowed per chunk.\n",
    "\n",
    "    Returns:\n",
    "        int: The calculated chunk size.\n",
    "\n",
    "    Description:\n",
    "        This function calculates the chunk size based on the given token count and token limit.\n",
    "        If the token count is less than or equal to the token limit, the function returns the token count as the chunk size.\n",
    "        Otherwise, it calculates the number of chunks needed to accommodate all the tokens within the token limit.\n",
    "        The chunk size is determined by dividing the token limit by the number of chunks.\n",
    "        If there are remaining tokens after dividing the token count by the token limit,\n",
    "        the chunk size is adjusted by adding the remaining tokens divided by the number of chunks.\n",
    "\n",
    "    Example:\n",
    "        >>> calculate_chunk_size(1000, 500)\n",
    "        500\n",
    "        >>> calculate_chunk_size(1530, 500)\n",
    "        389\n",
    "        >>> calculate_chunk_size(2242, 500)\n",
    "        496\n",
    "    \"\"\"\n",
    "\n",
    "    if token_count <= token_limit:\n",
    "        return token_count\n",
    "\n",
    "    num_chunks = (token_count + token_limit - 1) // token_limit\n",
    "    chunk_size = token_count // num_chunks\n",
    "\n",
    "    remaining_tokens = token_count % token_limit\n",
    "    if remaining_tokens > 0:\n",
    "        chunk_size += remaining_tokens // num_chunks\n",
    "\n",
    "    return chunk_size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a01f7f83-90a9-4ae9-9cb2-be15e6a802a6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T11:55:56.468212Z",
     "iopub.status.busy": "2024-07-14T11:55:56.467646Z",
     "iopub.status.idle": "2024-07-14T11:55:56.477092Z",
     "shell.execute_reply": "2024-07-14T11:55:56.475730Z",
     "shell.execute_reply.started": "2024-07-14T11:55:56.468168Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 向上取整\n",
    "(1530 + 500-1) // 500"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9d3554da-c693-4c22-a226-b0aff176c208",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T05:11:04.231853Z",
     "iopub.status.busy": "2024-07-14T05:11:04.231261Z",
     "iopub.status.idle": "2024-07-14T05:11:04.241210Z",
     "shell.execute_reply": "2024-07-14T05:11:04.239254Z",
     "shell.execute_reply.started": "2024-07-14T05:11:04.231808Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "382"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1530 // 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ab0eccfa-8139-4a37-885f-efe19b8acf43",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T05:11:13.608088Z",
     "iopub.status.busy": "2024-07-14T05:11:13.607472Z",
     "iopub.status.idle": "2024-07-14T05:11:13.614467Z",
     "shell.execute_reply": "2024-07-14T05:11:13.613609Z",
     "shell.execute_reply.started": "2024-07-14T05:11:13.608043Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "30"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1530 % 500"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "85552e1d-d298-4d58-8ce5-8359386722ca",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T05:11:38.617749Z",
     "iopub.status.busy": "2024-07-14T05:11:38.616268Z",
     "iopub.status.idle": "2024-07-14T05:11:38.626539Z",
     "shell.execute_reply": "2024-07-14T05:11:38.625255Z",
     "shell.execute_reply.started": "2024-07-14T05:11:38.617704Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "389"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "382 + 30//4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7a09a9da-5c16-4247-9732-994c3b862c07",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T12:54:10.327428Z",
     "iopub.status.busy": "2024-07-14T12:54:10.326843Z",
     "iopub.status.idle": "2024-07-14T12:54:10.995924Z",
     "shell.execute_reply": "2024-07-14T12:54:10.995326Z",
     "shell.execute_reply.started": "2024-07-14T12:54:10.327381Z"
    }
   },
   "outputs": [],
   "source": [
    "from utils import translate\n",
    "\n",
    "source_lang = \"English\"\n",
    "target_lang = \"Chinese\"\n",
    "\n",
    "source_text = '''The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.\n",
    "Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].\n",
    "Recurrent models typically factor computation along the symbol positions of the input and output\n",
    "sequences  Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$, as a function of the previous hidden state $h_{t−1}$ and the input for position $t$. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental\n",
    "constraint of sequential computation, however, remains.\n",
    "Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.\n",
    "In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. \n",
    "The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n",
    "[16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building\n",
    "block, computing hidden representations in parallel for all input and output positions. In these models,\n",
    "the number of operations required to relate signals from two arbitrary input or output positions grows\n",
    "in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes\n",
    "it more difficult to learn dependencies between distant positions [12]. In the Transformer this is\n",
    "reduced to a constant number of operations, albeit at the cost of reduced effective resolution due\n",
    "to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as\n",
    "described in section 3.2.\n",
    "Self-attention, sometimes called intra-attention is an attention mechanism relating different positions\n",
    "of a single sequence in order to compute a representation of the sequence. Self-attention has been\n",
    "used successfully in a variety of tasks including reading comprehension, abstractive summarization,\n",
    "textual entailment and learning task-independent sentence representations [4, 27, 28, 22].\n",
    "End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and\n",
    "language modeling tasks [34].\n",
    "To the best of our knowledge, however, the Transformer is the first transduction model relying\n",
    "entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].\n",
    "\n",
    "In this work, we presented the Transformer, the first sequence transduction model based entirely on\n",
    "attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\n",
    "multi-headed self-attention.\n",
    "For translation tasks, the Transformer can be trained significantly faster than architectures based\n",
    "on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014\n",
    "English-to-French translation tasks, we achieve a new state of the art. In the former task our best\n",
    "model outperforms even all previously reported ensembles.\n",
    "We are excited about the future of attention-based models and plan to apply them to other tasks. We\n",
    "plan to extend the Transformer to problems involving input and output modalities other than text and\n",
    "to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs\n",
    "such as images, audio and video. Making generation less sequential is another research goals of ours.\n",
    "The code we used to train and evaluate our models is available at https://github.com/\n",
    "tensorflow/tensor2tensor.\n",
    "Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful\n",
    "comments, corrections and inspiration.\n",
    "'''\n",
    "\n",
    "country = 'China'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8bafcb02-8115-495d-8688-299fb2203135",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T12:54:13.928361Z",
     "iopub.status.busy": "2024-07-14T12:54:13.928038Z",
     "iopub.status.idle": "2024-07-14T12:57:20.651892Z",
     "shell.execute_reply": "2024-07-14T12:57:20.649819Z",
     "shell.execute_reply.started": "2024-07-14T12:54:13.928341Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ic| num_tokens_in_text: 1166\n",
      "ic| 'Translating text as multiple chunks'\n",
      "ic| token_size: 666\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "initial translation: ['主导的序列转换模型基于复杂的循环或卷积神经网络，采用编码器-解码器配置。性能最佳的模型还通过注意力机制连接编码器和解码器。我们提出了一种新的简单网络架构，即Transformer，它完全基于注意力机制，彻底摒弃了循环和卷积。在两个机器翻译任务上的实验表明，这些模型在质量上更优越，同时具有更高的并行性，并且训练所需的时间大大减少。我们的模型在WMT 2014英语到德语翻译任务上达到了28.4 BLEU，比现有的最佳结果（包括集成模型）提高了超过2 BLEU。在WMT 2014英语到法语翻译任务上，我们的模型在使用八个GPU训练3.5天后，建立了新的单模型最佳BLEU得分41.8，这只是文献中最佳模型训练成本的一小部分。我们展示了Transformer在应用于英语成分句法分析等其他任务时的良好泛化能力，无论是在大量还是有限的训练数据上均取得了成功。\\n循环神经网络，特别是长短期记忆（LSTM）和门控循环神经网络，已被确认为序列建模和转换问题（如语言建模和机器翻译）中的最先进方法。此后，许多努力持续推动循环语言模型和编码器-解码器架构的发展边界。\\n循环模型通常沿着输入和输出序列的符号位置进行计算分解。通过将位置与计算时间的步骤对齐，它们生成一系列隐藏状态$h_t$，作为前一个隐藏状态$h_{t−1}$和位置$t$的输入的函数。这种固有的顺序性质阻碍了训练示例内的并行化，这在较长的序列长度时变得至关重要，因为内存限制限制了跨示例的批处理。最近的工作通过分解技巧和条件计算在计算效率上取得了显著改进，同时在后者的情况下也提高了模型性能。然而，顺序计算的基本限制仍然存在。\\n注意力机制已成为各种任务中引人注目的序列建模和转换模型的重要组成部分，允许在不考虑输入或输出序列中的距离的情况下建模依赖关系。然而，在少数几个案例中，这种注意力机制是与循环网络结合使用的。\\n在这项工作中，我们提出了Transformer，一种摒弃循环并完全依赖注意力机制来绘制输入和输出之间全局依赖关系的模型架构。Transformer允许显著更多的并行化，并且在经过短短十二小时的训练后就可以达到翻译质量的新艺术状态。', '在位置之间的距离上，对于ConvS2S是线性的，对于ByteNet是对数的。这使得学习远距离位置之间的依赖性变得更加困难[12]。在变压器中，这被减少到常数次操作，尽管由于平均加权位置的注意力，有效分辨率降低了，这一效应我们通过多头注意力进行了抵消，如第3.2节所述。\\n自注意力，有时被称为内部注意力，是一种关联单一序列不同位置以计算序列表示的注意力机制。自注意力已成功用于包括阅读理解、抽象总结、文本蕴含和学习任务独立的句子表示等多种任务[4, 27, 28, 22]。\\n端到端的记忆网络基于一种循环的注意力机制，而不是序列对齐的递归，已被证明在简单语言问答和语言建模任务中表现良好[34]。\\n据我们所知，变压器是第一个完全依赖自注意力来计算其输入和输出的表示，而不使用序列对齐的RNN或卷积的转导模型。在接下来的部分中，我们将描述变压器，激励自注意力，并讨论其相对于诸如[17, 18]和[9]等模型的优势。', '在这项工作中，我们介绍了Transformer，这是第一个完全基于注意力的序列转换模型，它用多头自注意力替换了编解码器架构中最常用的循环层。\\n\\n对于翻译任务，Transformer的训练速度比基于循环或卷积层的架构要快得多。在WMT 2014英德和WMT 2014英法翻译任务上，我们都达到了新的艺术状态。在前者任务中，我们的最佳模型甚至超过了以前报道的所有集合。\\n\\n我们对基于注意力的模型的未来感到兴奋，并计划将它们应用于其他任务。我们计划将Transformer扩展到涉及非文本的输入和输出模态的问题，并调查局部的、受限的注意力机制，以有效处理大型输入和输出，如图像、音频和视频。使生成过程更少序列化是我们的另一个研究目标。\\n\\n我们用于训练和评估模型的代码可在https://github.com/tensorflow/tensor2tensor上找到。\\n\\n致谢 我们感谢Nal Kalchbrenner和Stephan Gouws对他们的有益评论、更正和灵感表示感谢。']\n",
      "reflection: ['1. **Accuracy and Terminology**: The phrase \"性能最佳的模型还通过注意力机制连接编码器和解码器\" can be improved for clarity and accuracy. Consider revising to \"表现最佳的模型还通过注意力机制将编码器与解码器连接起来\". This change ensures that the translation accurately reflects the \"best performing\" aspect and maintains the correct technical terminology.\\n\\n2. **Fluency and Style**: In the sentence \"我们的模型在WMT 2014英语到德语翻译任务上达到了28.4 BLEU，比现有的最佳结果（包括集成模型）提高了超过2 BLEU\", consider revising to \"我们的模型在WMT 2014英语到德语翻译任务上取得了28.4的BLEU分数，较现有最佳成绩（包括集成模型）高出2分以上\". This revision enhances fluency by using a more natural numerical expression in Chinese and improves the flow of the sentence.\\n\\n3. **Terminology Consistency**: The term \"并行性\" is used to translate \"parallelizable\". It would be more accurate and consistent to use \"可并行化\" throughout the translation to ensure that the terminology aligns with technical standards in Chinese.\\n\\n4. **Style and Fluency**: The translation \"这只是文献中最佳模型训练成本的一小部分\" could be more fluently expressed as \"这仅是文献中顶尖模型训练成本的一小部分\". This adjustment enhances the style by better matching the comparative excellence implied in the source text.\\n\\n5. **Accuracy and Style**: The sentence \"我们展示了Transformer在应用于英语成分句法分析等其他任务时的良好泛化能力，无论是在大量还是有限的训练数据上均取得了成功\" might be better translated as \"我们证明了Transformer在应用于英语成分句法分析等其他任务上的出色泛化能力，无论是在大规模还是有限的训练数据上均表现优异\". This revision corrects the tone to better reflect the strong endorsement of the Transformer\\'s capabilities and uses more precise adjectives to describe the performance.\\n\\n6. **Fluency**: The phrase \"一种摒弃循环并完全依赖注意力机制来绘制输入和输出之间全局依赖关系的模型架构\" can be made more fluent by rephrasing to \"一种抛弃循环结构，完全依赖注意力机制来构建输入与输出之间全局依赖的模型架构\". This adjustment improves readability and maintains the technical accuracy of the description.\\n\\n7. **Terminology and Style**: In the translation \"并且在经过短短十二小时的训练后就可以达到翻译质量的新艺术状态\", consider revising to \"仅需短短十二小时训练，便可在翻译质量上达到新的艺术水平\". This change not only corrects the phrase to better convey the meaning of achieving a new state of the art but also aligns with a more natural expression in Chinese.', '1. **Terminology Consistency**: The term \"Transformer\" should be translated consistently. In the provided translation, it is translated as \"变压器\" which actually means \"electrical transformer\". The correct translation in the context of machine learning should be \"变换器\" to avoid confusion with electrical devices.\\n\\n2. **Technical Accuracy**: The phrase \"linearly for ConvS2S and logarithmically for ByteNet\" is translated correctly in terms of the words \"linearly\" and \"logarithmically\". However, consider rephrasing to \"对于ConvS2S是线性增长，对于ByteNet是对数增长\" to enhance clarity that it refers to the growth rate, not just a linear or logarithmic relationship.\\n\\n3. **Clarification and Fluency**: The phrase \"这使得学习远距离位置之间的依赖性变得更加困难\" could be made more fluent and clear. Consider revising to \"这样一来，学习位置间远距离的依赖关系变得更为困难\" to improve the flow and readability of the sentence.\\n\\n4. **Technical Detail**: In the sentence about Multi-Head Attention, it\\'s translated as \"这一效应我们通过多头注意力进行了抵消\". To clarify the mechanism and maintain technical accuracy, consider \"我们通过使用多头注意力机制来抵消这一影响\" to explicitly state the use of the mechanism.\\n\\n5. **Consistency in Technical Terms**: The term \"self-attention\" is translated as \"自注意力\", which is correct, but \"intra-attention\" is translated as \"内部注意力\". To maintain consistency and avoid confusion, consider using \"自注意力\" for both terms since they refer to the same concept.\\n\\n6. **Fluency and Style**: The translation of \"End-to-end memory networks are based on a recurrent attention mechanism\" is a bit choppy. Consider revising to \"端到端记忆网络是基于一种循环注意机制构建的\" for better fluency and to match the style of the source text.\\n\\n7. **Contextual Accuracy**: The final sentence \"在接下来的部分中，我们将描述变压器，激励自注意力，并讨论其相对于诸如[17, 18]和[9]等模型的优势。\" could be more accurately translated to reflect the future discussion in the text. Consider \"在后续章节中，我们将详细介绍变换器，阐述自注意力的动机，并探讨它与其他模型（如[17, 18]和[9]）相比的优势。\" This maintains the anticipatory nature of the original text and corrects the name of the Transformer.\\n\\n8. **Cultural and Stylistic Adaptation**: Ensure that the translation uses formal language and technical terminology consistently, as the source text is academic and technical. Avoid colloquial expressions and ensure that the translation maintains a formal tone throughout.', '1. **Accuracy and Terminology**: In the sentence \"我们都达到了新的艺术状态\", the phrase \"新的艺术状态\" is a literal translation of \"a new state of the art\" which might not be clear in Chinese. It could be improved to \"我们都达到了新的最高水平\" or \"我们都创造了新的行业标准\", which more accurately conveys the meaning of setting a new benchmark or standard in the field.\\n\\n2. **Fluency and Style**: The phrase \"我们的最佳模型甚至超过了以前报道的所有集合\" uses \"所有集合\" to translate \"all previously reported ensembles\". The term \"集合\" might be confusing as it typically refers to a mathematical set. A more fluent translation could be \"我们的最佳模型甚至超过了此前所有报道过的模型组合\".\\n\\n3. **Terminology**: The term \"多头自注意力\" is a direct translation of \"multi-headed self-attention\". While technically correct, it could be enhanced for clarity by adding a brief explanatory phrase, especially for readers unfamiliar with the concept. Consider revising to \"多头自注意力（一种处理多种关系的注意力机制）\".\\n\\n4. **Style and Fluency**: The sentence \"使生成过程更少序列化是我们的另一个研究目标\" could be rephrased for better readability and style. A more natural expression might be \"我们还致力于研究如何减少生成过程的序列依赖性\".\\n\\n5. **Accuracy**: The translation \"我们感谢Nal Kalchbrenner和Stephan Gouws对他们的有益评论、更正和灵感表示感谢\" repeats \"表示感谢\" which is redundant. It should be simplified to \"我们感谢Nal Kalchbrenner和Stephan Gouws对他们的有益评论、更正和灵感\".\\n\\n6. **Fluency**: The URL \"https://github.com/tensorflow/tensor2tensor\" could be better integrated into the sentence. Consider adjusting the structure to \"我们用于训练和评估模型的代码已发布在GitHub上，可通过以下链接访问：https://github.com/tensorflow/tensor2tensor。\" This provides a smoother reading experience.\\n\\n7. **Terminology and Style**: In \"调查局部的、受限的注意力机制\", the phrase \"调查\" might be better translated as \"研究\" to match the academic and technical context of the source text. Additionally, \"局部的、受限的\" could be expanded for clarity, perhaps as \"局部的且受限制的（针对特定区域的）\".\\n\\n8. **Fluency**: The structure of \"我们对基于注意力的模型的未来感到兴奋，并计划将它们应用于其他任务\" could be improved for better flow. A more natural phrasing might be \"我们对基于注意力模型的未来发展感到兴奋，并计划将其应用到更多任务中\". \\n\\nBy addressing these points, the translation can be made more accurate, fluent, and stylistically appropriate for the intended audience.']\n",
      "final translation: ['主导的序列转换模型基于复杂的循环或卷积神经网络，采用编码器-解码器配置。表现最佳的模型还通过注意力机制将编码器与解码器连接起来。我们提出了一种新的简单网络架构，即Transformer，它完全基于注意力机制，彻底摒弃了循环和卷积。在两个机器翻译任务上的实验表明，这些模型在质量上更优越，同时具有更高的可并行化性，并且训练所需的时间大大减少。我们的模型在WMT 2014英语到德语翻译任务上取得了28.4的BLEU分数，较现有最佳成绩（包括集成模型）高出2分以上。在WMT 2014英语到法语翻译任务上，我们的模型在使用八个GPU训练3.5天后，建立了新的单模型最佳BLEU得分41.8，这仅是文献中顶尖模型训练成本的一小部分。我们证明了Transformer在应用于英语成分句法分析等其他任务上的出色泛化能力，无论是在大规模还是有限的训练数据上均表现优异。\\n循环神经网络，特别是长短期记忆（LSTM）和门控循环神经网络，已被确认为序列建模和转换问题（如语言建模和机器翻译）中的最先进方法。此后，许多努力持续推动循环语言模型和编码器-解码器架构的发展边界。\\n循环模型通常沿着输入和输出序列的符号位置进行计算分解。通过将位置与计算时间的步骤对齐，它们生成一系列隐藏状态$h_t$，作为前一个隐藏状态$h_{t−1}$和位置$t$的输入的函数。这种固有的顺序性质阻碍了训练示例内的并行化，这在较长的序列长度时变得至关重要，因为内存限制限制了跨示例的批处理。最近的工作通过分解技巧和条件计算在计算效率上取得了显著改进，同时在后者的情况下也提高了模型性能。然而，顺序计算的基本限制仍然存在。\\n注意力机制已成为各种任务中引人注目的序列建模和转换模型的重要组成部分，允许在不考虑输入或输出序列中的距离的情况下建模依赖关系。然而，在少数几个案例中，这种注意力机制是与循环网络结合使用的。\\n在这项工作中，我们提出了Transformer，一种抛弃循环结构，完全依赖注意力机制来构建输入与输出之间全局依赖的模型架构。Transformer允许显著更多的并行化，并且仅需短短十二小时训练，便可在翻译质量上达到新的艺术水平。', '在位置之间的距离上，对于ConvS2S是线性增长，对于ByteNet是对数增长。这样一来，学习位置间远距离的依赖关系变得更为困难[12]。在变换器中，这被减少到常数次操作，尽管由于平均加权位置的注意力，有效分辨率降低了，我们通过使用多头注意力机制来抵消这一影响，如第3.2节所述。\\n自注意力，有时也称为自内部注意力，是一种关联单一序列不同位置以计算序列表示的注意力机制。自注意力已成功用于包括阅读理解、抽象总结、文本蕴含和学习任务独立的句子表示等多种任务[4, 27, 28, 22]。\\n端到端记忆网络是基于一种循环注意机制构建的，而不是序列对齐的递归，已被证明在简单语言问答和语言建模任务中表现良好[34]。\\n据我们所知，变换器是第一个完全依赖自注意力来计算其输入和输出的表示，而不使用序列对齐的RNN或卷积的转导模型。在后续章节中，我们将详细介绍变换器，阐述自注意力的动机，并探讨它与其他模型（如[17, 18]和[9]）相比的优势。', '在这项工作中，我们介绍了Transformer，这是第一个完全基于注意力的序列转换模型，它采用了多头自注意力（一种处理多种关系的注意力机制），取代了编解码器架构中通常使用的循环层。\\n\\n在翻译任务中，Transformer的训练速度远快于基于循环或卷积层的架构。在WMT 2014英德和WMT 2014英法翻译任务上，我们都创造了新的行业标准。在前者任务中，我们的最佳模型甚至超过了此前所有报道过的模型组合。\\n\\n我们对基于注意力模型的未来发展感到兴奋，并计划将其应用到更多任务中。我们计划将Transformer扩展到涉及非文本的输入和输出模态的问题，并研究局部的且受限制的（针对特定区域的）注意力机制，以有效处理大型输入和输出，如图像、音频和视频。我们还致力于研究如何减少生成过程的序列依赖性。\\n\\n我们用于训练和评估模型的代码已发布在GitHub上，可通过以下链接访问：https://github.com/tensorflow/tensor2tensor。\\n\\n致谢 我们感谢Nal Kalchbrenner和Stephan Gouws对他们的有益评论、更正和灵感。']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'主导的序列转换模型基于复杂的循环或卷积神经网络，采用编码器-解码器配置。表现最佳的模型还通过注意力机制将编码器与解码器连接起来。我们提出了一种新的简单网络架构，即Transformer，它完全基于注意力机制，彻底摒弃了循环和卷积。在两个机器翻译任务上的实验表明，这些模型在质量上更优越，同时具有更高的可并行化性，并且训练所需的时间大大减少。我们的模型在WMT 2014英语到德语翻译任务上取得了28.4的BLEU分数，较现有最佳成绩（包括集成模型）高出2分以上。在WMT 2014英语到法语翻译任务上，我们的模型在使用八个GPU训练3.5天后，建立了新的单模型最佳BLEU得分41.8，这仅是文献中顶尖模型训练成本的一小部分。我们证明了Transformer在应用于英语成分句法分析等其他任务上的出色泛化能力，无论是在大规模还是有限的训练数据上均表现优异。\\n循环神经网络，特别是长短期记忆（LSTM）和门控循环神经网络，已被确认为序列建模和转换问题（如语言建模和机器翻译）中的最先进方法。此后，许多努力持续推动循环语言模型和编码器-解码器架构的发展边界。\\n循环模型通常沿着输入和输出序列的符号位置进行计算分解。通过将位置与计算时间的步骤对齐，它们生成一系列隐藏状态$h_t$，作为前一个隐藏状态$h_{t−1}$和位置$t$的输入的函数。这种固有的顺序性质阻碍了训练示例内的并行化，这在较长的序列长度时变得至关重要，因为内存限制限制了跨示例的批处理。最近的工作通过分解技巧和条件计算在计算效率上取得了显著改进，同时在后者的情况下也提高了模型性能。然而，顺序计算的基本限制仍然存在。\\n注意力机制已成为各种任务中引人注目的序列建模和转换模型的重要组成部分，允许在不考虑输入或输出序列中的距离的情况下建模依赖关系。然而，在少数几个案例中，这种注意力机制是与循环网络结合使用的。\\n在这项工作中，我们提出了Transformer，一种抛弃循环结构，完全依赖注意力机制来构建输入与输出之间全局依赖的模型架构。Transformer允许显著更多的并行化，并且仅需短短十二小时训练，便可在翻译质量上达到新的艺术水平。在位置之间的距离上，对于ConvS2S是线性增长，对于ByteNet是对数增长。这样一来，学习位置间远距离的依赖关系变得更为困难[12]。在变换器中，这被减少到常数次操作，尽管由于平均加权位置的注意力，有效分辨率降低了，我们通过使用多头注意力机制来抵消这一影响，如第3.2节所述。\\n自注意力，有时也称为自内部注意力，是一种关联单一序列不同位置以计算序列表示的注意力机制。自注意力已成功用于包括阅读理解、抽象总结、文本蕴含和学习任务独立的句子表示等多种任务[4, 27, 28, 22]。\\n端到端记忆网络是基于一种循环注意机制构建的，而不是序列对齐的递归，已被证明在简单语言问答和语言建模任务中表现良好[34]。\\n据我们所知，变换器是第一个完全依赖自注意力来计算其输入和输出的表示，而不使用序列对齐的RNN或卷积的转导模型。在后续章节中，我们将详细介绍变换器，阐述自注意力的动机，并探讨它与其他模型（如[17, 18]和[9]）相比的优势。在这项工作中，我们介绍了Transformer，这是第一个完全基于注意力的序列转换模型，它采用了多头自注意力（一种处理多种关系的注意力机制），取代了编解码器架构中通常使用的循环层。\\n\\n在翻译任务中，Transformer的训练速度远快于基于循环或卷积层的架构。在WMT 2014英德和WMT 2014英法翻译任务上，我们都创造了新的行业标准。在前者任务中，我们的最佳模型甚至超过了此前所有报道过的模型组合。\\n\\n我们对基于注意力模型的未来发展感到兴奋，并计划将其应用到更多任务中。我们计划将Transformer扩展到涉及非文本的输入和输出模态的问题，并研究局部的且受限制的（针对特定区域的）注意力机制，以有效处理大型输入和输出，如图像、音频和视频。我们还致力于研究如何减少生成过程的序列依赖性。\\n\\n我们用于训练和评估模型的代码已发布在GitHub上，可通过以下链接访问：https://github.com/tensorflow/tensor2tensor。\\n\\n致谢 我们感谢Nal Kalchbrenner和Stephan Gouws对他们的有益评论、更正和灵感。'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "translate(source_lang, target_lang, source_text, country)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
